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Abstract 



When using the K-nearest neighbors meth od, one often ignores uncer tainty in the 
choice of K. To account for such uncertainty, l H olmes and Adamj « proposed a 
Bayesian framework for K- nearest neighbors (KNN). Their Bayesian KNN (BKNN) 
approach uses a pseudo-likelihood function, and s tandard Markov cha i n Mon te Carlo 
(MCMC) techniques to draw posterior samples. lHolmes and Adamsl (j2002h focused 
on the performance of BKNN in terms of misclassification error but did not assess its 
ability to quantify uncertainty. We present some evidence to show that BKNN still 
significantly underestimates model uncertainty. 
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1 Introduction 



The A"-nearest neighbors method (e.g.. lFix and Hodgeslll95ll ; ICover and Hartlll967l ) is con 



ceptually simple but flexible and useful in practice. It can be used for both regression and 
classification. We focus on classification only. 

Under the assumption that points close to one another should have similar responses, 
KNN classifies a new observation according to the class labels of its K nearest neighbors. 
In order to identify the neighbors, one must decide how to measure proximity among points 
and how to define the neighborhood. The most commonly-used distance metric is the Eu- 
clidean distance. The tuning parameter, K, is normally chosen by cross-validation. Figure 
[1] illustrates how KNN works. Suppose one takes K = 5. The possible predicted values are 
{0/5, 1/5, ••• ,5/5}. Among those five nearest neighbors of test point A, four out of five 
belong to class 0. Therefore, A is classified to class with an estimated probability of 4/5. 
Similarly, test point B is classified to class 1 with an estimated probability of 4/5. 
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Figure 1: Simulated example illustrating KNN with K = 5. Training observations from class 
are indicated by the symbol "©" , and those from class 1 are indicated by the symbol "•" . A and 
B are two test points. 



Holmes and Adamd (120021 ) pointed out that regular KNN does not account for the uncer- 



tainty in the choice of K. They presented a Bayesian framework for KNN (BKNN), compared 
its performance with the regular KNN on several benchmark data sets and concluded that 
BKNN outperformed KNN in terms of misclassification error. By model averaging over the 
posterior of K, BKNN is able to improve predictive performance. Unfortunately, they never 
assessed the inferential aspect of BKNN. In this paper, we present some evidence to show 
that, even though BKNN is designed to capture the uncertainty in the choice of K, it still 
significantly underestimates overall uncertainty. 



2 BKNN 

We first give a quick overview of BKNN in the context of a classification problem with 



Q different classes. To cast KNN into a Bayesian framework, iHolmes and Adamd (120021 ) 
adopted the following (pseudo) likelihood function for the data: 

The indicator function / is 1 whenever its argument is true, and the notation "j e A^(xj, K)" 
identifies the indices j of the ^-nearest neighbours of Xj. Thus ^2j eN r x . K \ l{yj = Hi) is K 
times the estimated probability from a conventional KNN model. 

There are two unknown parameters, K and (3. The parameter K is an positive inte- 
ger controlling the number of nearest neighbors; and (3 is a positive continuous parameter 
governing the strength of interaction between a data point and its neighbors. 
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The likelihood function (pQ) is a so-called pseudo-likelihood function (see, e.g., lBesaglll974j . 



19751 ). Unlike regular likelihood functions, the component for data point yi depends on the 
class labels of other data points yj, for j ^ i. Treating f3 and K as random variables, the 
marginal predictive distribution for a new data point (x n+ i, y n+ i) based on the training data 
(X, Y) is given by 

p(y n+ i|x n+1 ,X,Y) = [ P(Vn+i\xn+i, X, Y, (3, K)p((3, K\X, Y)d(3, (2) 

K J 

where 

p(J3, K\X, Y) cc p(Y|X, (3, K)p(J3, K) 

is the posterior distribution of K). 

Except for the fact that f3 should be positive, little p r ior kn owledge is known on the 
likely values of K and (3. Therefore, iHolmes and Adamd fl2002h adopt independent, non- 



informative prior densities, 

p(J3,K)=p(/3)p(K) 

where 

p{K) = UNIF[1, . . . , K max ] with K max = n, p((3) = cl{(3 > 0), 

and c is a constant so that p(f3) is an improper flat prior on R + . 

A random-walk Metropolis-Hastings algorithm is then used to draw M samples from the 
posterior p(j3, iT|X, Y), so that ([2]) can be approximated by 



p(z/n+i|x n+1 ,X,Y) « i^K^K+^X^,^'),^)), (3) 
where (K^\f3^) is the jth sample from the posterior. 



3 Experiments and results 

One might believe that the Bayesian formulation will automatically account for model un- 
certainty, and that this is a major advantage of BKNN over regular KNN. We now describe 
a simple experiment that shows BKNN still significantly underestimates model uncertainty. 

The same experiment is repeated 100 times. Each time, we first generate n = 250 pairs of 
training data from a known, two-class model (details in Section [37TT) . We then fit BKNN and 
regular KNN on the training data, and let them make predictions at a set of 160 pre-selected 
test points (details in Section I3T2"]) . For each test point, say (x n+1 , y n+ i), our parameter of 
interest is 

9 n+1 = Pr(y n+1 = l|x n+ i). (4) 

We construct both point estimates (Section I3.3P and interval estimates (Section 13.41) of 9 n+ i. 

6 n+ i and I n+ \. 

To fit BKNN, we use the Matlab code pr ovided bvlHolmes and Adamsl (120021 ) and exactly 



the same MCMC setting as described in Holmes and Adamsl ( 



20021. Section 3.1). To fit 



regular KNN, we use the knn function in R. 
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Figure 2: (a) Training data from one experiment, and the true probability contour, Pr(y = l|x) 
as given by ([5|). (b) The fixed set of test points, and the true probability contour. 

3.1 Simulation Model 



Holmes and Adama (120021 . Section 3.1) illustrated BKNN with a synthetic dataset consisting 



of 250 training and 1000 test points, taken from |http : //www, stats . ox. ac .uk /pub/PRNN, 
These data were originally generated from two classes, each being an equal mixture of two 
bivariate normal distributions. In order to be able to generate slightly different training data 
every time we repeat our experiment, we imitate this synthetic data set by assuming the 
underlying distributions of class 1 (Ci) and class (Co) to be: 

x|Ci ~ /i(x) = 0.5BVN (/x n , S) + 0.5BVN Qu 12 , S) 
x|C ~ /o(x) = 0.5BVN (/x 01 , S) + 0.5BVN (/x 02 , S) , 



with 



0.3 \ / 0.4 \ / -0.7 \ / 0.3 



and 



- 1 0.7 J ' ^2-1 _ 7 I , Moi - I 0.3 / ' Mo2 " V 0.3 



0.03 
0.03 



The prior class probabilities are taken to be equal, i.e., Pr(y = 1) = Pr(y = 0) = 0.5. Given 
any data point x, its posterior probability of being in C\ can be calculated by Bayes' rule 

Figure [2](a) shows the training data from one experiment and the true decision boundary. 
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3.2 Test points 

Instead of focusing on the total misclassification error, we focus on predictions made at a fixed 
set of test points. These test points are chosen as follows: first, we lay out a grid along the 
first coordinate, X\ 6 {—1, —0.9, —0.8, ■ • • , 0.8, 0.9}; for each X\ in that grid, eight different 
values of X2 are chosen so that the test points together "cover" the critical part of the true 
posterior probability contour. A total of 160 test points are obtained this way. Figure 13(b) 
shows the fixed set of test points and the true posterior probability contour, Pr(y = l|x), as 
given by (JSJ). 

In what follows, we refer to 9 n+ \ as the key parameter of interest, but it should be 
understood that the subscript "n + 1" is used to refer to any test point. There are altogether 
160 such test points, and exactly the same calculations are performed for all of them, not 
just one of them. 

3.3 Point estimates of 9 n+ \ 

For BKNN, the point estimate of 9 n +i = Pr(y n +i = l|x n+ i) is the posterior mean: 



M 

0^ NN = ^E 1 ^ 1 = l|Xn + l,X,Y,^),^')), 



where (K^\(3^) are samples drawn from the posterior distribution, p(K, /3|X, Y). For reg- 
ular KNN, one chooses the parameter K by cross-validation, and normally uses the original 
KNN score 

jeiV(x n+ i,JsT) 

as the point estimate. In order to make things fully comparable, however, we further trans- 
form the KNN scores by a logistic model fitted using the training data. We describe this 
next. 

Notice that, for binary classification problems, i.e., Q = 2, each multiplicative term in (QQ) 
can be rewritten as 



PiVi\xi,P,K) 



exp{(/3/A) J2 j& N(^,K) KVj = Vi)) 



(t) 



exp{(/3/A)£. 



1 + exp{(/?/A)[2 EjeNi^K) Ifa = Vi) ~ *}} 
exp{/3[2^) - 1]} 



(6) 



where 



l + exp{/3[2^)-l]}' 

9(Vi) = g Yl % = ( 7 ) 
is the output of KNN. The step labelled (t) in ([6]) is due to the identity 



j£N(Ki,K) jeN(x i: K) 
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A 

n+1 vs n+1 




Figure 3: Average of 100 point estimates versus the true parameter value, for all 160 test points. 
A 45-degree reference line going through the origin is also displayed. 

Notice that ([6]) is equivalent to running a logistic regression with no intercept and [2g(yi) — 1] 
as the only covariate. Since this extra transformation is built into BKNN, we use 

qKnn = exp{/3[2e + r-l]} m 
n+l l + exp{/3[2Cr-l]} 

as the point estimate of regular KNN in order to be fully comparable with BKNN. In (JSJ), 
(3 is obtained by running a logistic regression of yi onto [2g(yi) — 1] with no intercept using 
the training data. 

After repeating the experiment 100 times, we obtain 100 slightly different point estimates 
at each x n+ i. Figure [3] plots the average of these 100 point estimates against the true value 
for all 160 test points. We see that both BKNN and regular KNN give very similar point 
estimates. 

3.4 Interval estimates of 9 n+ i 

The main focus of our experiments is interval estimation. In particular, we are interested in 
the question of whether these interval estimates adequately capture model uncertainty. 
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(a) BKNN 



(b) KNN with Bootstrap 
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Figure 4: Estimated coverage probabilities of (a) In+i NN an d (b) In+{ N > f° r an 160 test points 



fKNN 



For BKNN, we use the 95% posterior (or credible) interval as our interval estimate, 
This is constructed by finding the 2.5th and 97.5th percentiles of the posterior samples. To 
obtain an interval estimate for regular KNN, 1^+1 , we resort to Efron's bootstrap. Given a 
training set, V, we generate 500 bootstrap samples, T>*, T>^ ■ ■ ■ -,T>\^ and repeat the entire 
KNN model building process — that is, choosing K by cross-validation and calculating 
6n+i,b according to (jHJ) — for every b = 1,2, • • ■ ,500. The interval estimate of 9 n+ x is 
constructed by taking the 2.5th and 97.5th percentiles of the set, {9 n+ i t i, • • • , n +i,5oo}- 



3.4.1 Coverage probabilities 

Our first question of interest is: What are the coverage probabilities of fn+[ N an d In+i NN 7 
After repeating the experiment 100 times, we obtain 100 slightly different interval estimates 
at each x n+ i. The coverage probability of I^ 1 NN (and that of I^+i ) can b e estimated 
easily by counting the number of times 9 n +i is included in the interval over the 100 experi- 
ments. Histograms of the estimated coverage probabilities for all 160 test points are shown 
in Figure HI The posterior intervals produced by BKNN can easily be seen to have fairly 
poor coverage overall. 



3.4.2 Lengths 

For each interval estimate, we can also calculate its length, e.g., 



length™ 



length 



KNN 
n+1 



nBKNN,97.5% _ qBKNN,2.5% 
°n+l a n+l 

nKNN,97.5% _ nKNN,2.5% 
U n+1 U n+1 



Let 



-BKNN 



KNN 



length n+1 and length n+1 
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Figure 5: Schematic illustration of our assessment protocol. Variation over 100 point estimates is 
used as a benchmark to assess the quality of the corresponding interval estimates. 

be the average lengths of these 100 interval estimates. Our second question of interest is: 
Are they too long, too short, or just right? In order to answer this question, we need a "gold 
standard" . 

The very reason for using these interval estimates is to reflect that there is uncertainty in 
our estimate of the underlying parameter, n +i- This uncertainty is easy to assess directly 
when one can repeatedly generate different sets of training data and repeatedly estimate the 
parameter, which is exactly what we have done. The standard deviations of the 100 point 
estimates (Section l3.3p . which we write as 

std(0**™) and rtdCfiSn 

give us a direct assessment of this uncertainty. 

If the point estimates, 9^+1* an d , are approximately normally distributed, then 
the correct lengths of the corresponding interval estimates should be roughly 4 times the 
aforementioned standard deviation, that is, 

h^hZ™ - 4xstd(^™), 0) 
h^thZ N « 4xstd$gn (10) 

We use (19^-( TT0]) as heuristic guidelines to assess how well the interval estimates can capture 
model uncertainty, despite lack of formal justification for the normal approximation. Figure[5] 
provides a schematic illustration of our assessment protocol. 

Figure [6] plots the average lengths of these 100 interval estimates against 4 times the 

i i i • • r ! i- • • i t t-BKNN 

standard deviations of the corresponding point estimates — that is, length +1 against 
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Iength n+1 vs 4 x std(0 n+1 ) 




0.1 0.2 0.3 0.4 0.5 0.6 
4 times the standard deviation of point estimate 



Figure 6: Average length of 100 interval estimates versus 4 times the standard deviation of the 
corresponding point estimate, for all 160 test points. Two reference lines - both going through the 
origin, one with slope=l and another with slope=l/2 — are also displayed. 

4 x std(^ 1 7VJV ) and length^* against 4 x std(^ 7V ) - - for all 160 test points. Here, 
it is easy to see that the Bayesian posterior intervals are apparently too short, whereas 
bootstrapping regular KNN gives a more accurate assessment of the amount of uncertainty 
in the point estimate. 



4 Discussion 



Why does BKNN underestimate uncertainty? We believe it is because BKNN only accounts 
for the uncertainty in the number of neighbors (i.e., the parameter K), but it is unable to 
account for the uncertainty in the spatial locations of these neighbors. This is a general 
phenomenon associated with pseudo-likelihood functio ns. 



Pseudo-likelihood functions were first introduced by iBesagi (11974 Il975l ) to model spatial 
interacti ons in lattic e systems. Since then, they ha ve been widely used in i mage process- 
ing ( e .g., Besag 19861) and network tomography (e.g., Strauss and Ikeda 1990 ; Liang and Yu 



20031 ; iRobins et all 120071 ). However, statistical inference based on pseudo-likelihood func- 



9 



tions is still in its infancy. Some researchers argue that pseudo-likelihood inference can be 
problematic since it ignores at least part of the dependence structu re in the data. In appli 



cations to model social networks, a number of researchers, such as Wasserman and Robins 



(120051 ) and Snijders ( 2002 ). have pointed out that maximum pseudo-likelihood estimates are 



substantially biased and the standard errors of the parameters are generally underestimated. 
For BKNN, the pseudo-likelihood function ([1]) clearly ignores the fact that the locations of 
one's neighbors are also random, not just the number of neighbors. 

However, for complex networks whose full likelihood functions are i ntractable, models 



based on pseudo-likelihood are attractive (if not the only) alternatives (IStrauss and Ikeda 



19901 ) . Rather than trying to write down the full likelihood functions for these difficult 
problems, it is probably more fruitful to concentrate our research efforts on how to adjust 
or correct standard error estimates produced by the pseudo-likelihood. To this effect, one 
interesting observation from Figure [6] is the fact that 

le^f™^2xstd(^ iV ). 

If we continue to use 4 x std(9^ 1 NN ) as the "gold standard", then these Bayesian posterior 
intervals are about half as long as they should be. We have observed this phenomenon on 
other examples, too, but do not yet have an explanation for it. 

BKNN 

Despite the fact that BKNN seems to underestimate overall uncertainty, that length n+1 
is still approximately proportional to std(^_^ 7VJV ) suggests that we can still rely on it to 
assess the relative uncertainty of its predictions. For many practical problems, this is still 
very useful. For example, if two accounts, A and B, are both predicted to be fraudulent with 
a high probability of 0.9 but the posterior interval of A is twice as long as that of B, then 
it is natural for a financial institution to spend its limited resources investigating account B 
rather than account A. 
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